One way to characterize the compact structures of 

o: lattice protein model* 

o' 

<N ; Bin Wangi, Zu-Guo Yu^'i 

r"*. -'^ Institute of Theoretical Physics, Chinese Academy of Sciences, 
C/^ ; P.O. Box 2735, Beijing 100080, P. R. China. 

CO i ^Department of Mathematics, Xiangtan Universiy, 

I ^' Hunan 411105, P.R. China 

Ph; February 2, 2008 

^; 

O ■ Abstract 

C/3 . 

^.^, On the study of protein folding, our understanding about the protein structures is limited. 

JS^ ' In this paper we find one way to characterize the compact structures of lattice protein model. 

, *— <. A quantity called Partnum is given to each compact structure. The Partnum is compared 

with the concept Designability of protein structures emerged recently. It is shown that the 

'"' ' highly designable structures have, on average, an atypical number of local degree of freedom. 

QQ , The statistical property of Partnum and its dependence on sequence length is also studied. 

o: 
o. 

ON ■ 1 Introduction 

o ■ 

^^ I The study of protein folding is fundamental on both theory and application. In order to tackle 

c/3 ' protein folding problem physically, it is important to pay much attention to concrete proteins 

.y ' and consider the details of interactions, such as for medical purpose. But there are also "global 

j^' views" that should be noticed. For example. The possible configurations of folded proteins are 

i-S^ . enormous, while that can be observed in living form is rather limited. These protein structures 

• • , generally can be described as belonging to a limit number of families. In each family, ignoring 

.,-H ! the details, the proteins possess similar overall conformations, and in many cases the structures 

r> I show regular forms or approximate symmetry. |j|, g, y, ^, |5|, g Another example is that single 

d ' domain proteins was observed only within a certain range of sequence length: the number of 

amino acid residues in single domain proteins seldom exceeds 200. Larger proteins usually fold 

into multi-domains native states. Q 

With the accumulation of knowledge about the structures and functions of proteins, it was 
found that many proteins of similar structures pursue complete different functions, while pro- 
teins with different tertiary structures may perform similar functions. These suggested that to 
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understand the protein folding problem physically, one should first get to know the properties 
of protein structures. Q Based on the the concepts from the physics of spin glass, study shows 
that to fold efficiently, proteins require a specially shaped energy landscape resembling a fun- 
nel. A heteropolymer with a completely random sequence generically possess a rugged energy 



landscape without a funnel. [§, [^ Goldstein et al^^, 11| have worked on optimizing energy func- 
tions for protein structure prediction. They found that some structures are more optimizable 
than others, i.e., there exist structures for which the funneled energy landscape can be obtained 
within a wide range of interaction parameters, while for some other structures the parameters 
for fast folding are much more restricted. The funneled landscape theory argued that the inter- 
actions in the folded structure must act in concert more effectively than expected in the most 
random cases, jl^ Accordingly, compared with most other structures, the superiority of highly 
optimizable structure should be that its geometric arrangement permit more sequences to reach 
the concert interaction states. 



Other studies on the thermodynamic of lattice protein models also support the above idea. [13 



14, 15, |I^ In the lattice HP models, a protein is represented by a self avoiding chain of beads 
placed on a discrete lattice with two types of beads: the Polar (P) and the Hydrophobic (H). 
A sequence is specified by a choice of monomer type at each position on the chain {xi}. Where 
Xi could be either H- or P-type, and i is a monomer index. A structure is specified by a set of 
coordinates for all the monomers {rj}. The energy has the form: 

i<j 

where A(rj — rj) = 1 when rj and rj are adjoining lattice sites while they are not adjacent along 
the sequence, and A(rj — r^) = in other cases. Interaction parameter Ex^x- differ according 
to the contact type HH, HP, or PP. Given the interaction parameters, it is possible to find out 
the ground state structure(s) of each sequence. Study shows that structures differ markedly 
in their tendency to be chosen by sequences as their unique ground states. The number of 
sequences which choice the structure as unique ground state is called the Designability of 
this structure. It was argued that only highly designable structures are thermodynamically 
stable and stable against mutation, and thus can be chosen by nature to fulfill the duty of 
life.[0] Though interaction parameters used may differ strongly in different studies, the mostly 



designed structures do not depend strongly on the detail of interactions.! 13, 15, 16| 

From above discussion we see that it should be essential to investigate the protein folding 
problem from structural point of view. To see the problem more clearly, we take square lattice 
HP model as an example. The total number of the most compact structures of 36 beads chain 
is 57337. [14| Consider 36 beads homopolymer with interaction parameter Ex^xj = Eq < 0. All 



the 57337 structures give the same energy when one such homopolymer fold onto each of them. 
Therefore the folded energy can not be used to distinguish the compact structures from each 
other. The essential here is that of discrimination, or characterization: give ways to tell how 
and why structures differ from each other. Nature's way to break the symmetry is to replace 
homopolymer with heteropolymer. From this point of view, the success of lattice protein model 



is that it help to reveal this secret of nature. 



Studies focusing on the properties of protein structures is still lack,[l?|] in spite of some 



recent elaborations in this direction. [^, ^, 2C] In this article we present one way to break the 
symmetry, to distinguish the compact structures of lattice model without explicitly considering 
concrete interaction form. However, since only compact structures are considered here, an 
loose constraint is actually set on interactions: interactions under which compact structures 
are preferred as ground energy states. The method gives a number called partition number 
(Partnum) to each compact structure during a simple process. The Partnums of structures 
differ strongly, so giving one way to distinguish them from each other. 

In the following section we will give the detail of the method, and compare the Partnum 
with designability. The statistical properties of Partnums are discussed in section II. The last 
section is for some remarks. 

2 The definition and interpretation of Partnum 



It is easy to find out all the compact structures of certain chain length with computer. [ 21 1 Take 
9 beads chain as an example. The search is self avoiding and restricted to the 3x3 square lattice 
shown in Fig. 1(A), and the resulting structures should not be related by rotation or reflection 
symmetry. As a result, there are only three starting points, (0, 0), (0, 1) and (1, 1), for the search 
of structures. To find the structures start at (0,0), the first step is to go to (1,0). This is the 
only choice, because (0,1) is a symmetric point of (1,0). We give all the structures following 
this step a number pi = ln{l). Now go to the next site. There are two possible choices: (2, 0) or 
(1,1). Since the walk is self avoiding and restricted to the 3x3 lattices, the walk following certain 
choice may fail to extend to 9 beads length. The choice that will reach to 9 beads length is called 
acceptable. Suppose that both (2,0) and (1,1) are acceptable. Then each compact structure 
which win be generated following (0,0) — > (1,0) — > (2,0) or (0,0) — > (1,0) — > (1,1) is 
given a number /n(l/2). Generally speaking, restricting to 3 x 3 lattice and beginning at a 
starting point, there are totally 8 steps to finish a self avoiding walk. Each step is given one 
number according to the following rule: if the i-th step has totally C acceptable choices not 
being symmetrically related, then the step is given a number called partnum of i-th step 
Pi = ln{l/C). For 2D square lattice, the largest possible choice Cq is 3. 

Adding all the 8 numbers and then dividing the sum by 8, we get the Partnum PI. Here the 
structure is actually oriented . The consideration of oriented walk is reasonable in the case of 
protein structures, because the native protein structure would become unstable if the sequence 
is reversed, and also protein in life are produced successively from one end to another. However, 
if one consider the start and end reversal of the walk as a symmetric operation, then one oriented 
walk and its reverse together correspond to a structure that is not related with the direction. In 
the follows, oriented walk and non- oriented structures are used to distinguish the two different 
ways of viewing structure, and the Partnums corresponding to them are denoted as PI and 
P2, respectively. However, when it is no need to distinguish them, simply structure is used and 
the Partnum is denoted as P. For the non-oriented structure, the Partnum can be define as: 



P2 = -Pl(l) + Pl(2), where -Pl(l) is the Partnum of one of two oriented walks and -Pl(2) is 
that of its reverse. 

The Partnums of structures of other chain length can be obtained similarly. 

Since the original motivation of developing the Partnums of structures is to account for the 
difference of Designability of structures, in Fig. 2 we give the plot of Designability against Part- 
num of orientd structures on 5 x 5 lattice (the interaction parameters for calculating Designability 
is the same as used in Ref. |l^]). There is not strict correspondence between Designability and 
Partnum. However the linear fit of the data revealed that Designability tends to increase with 
the increase of Partnum (see Fig. 2). The same thing happens for other sequence length. In the 
case of 6 X 6 lattice, the structure with highest Designability [^, ^] possess the second largest 
Partnum (-P2). 

According to Fig. 1(B), an oriented walk corresponds to one path from the root to the top 
leave of the hierarchical tree. The value of Partnum of the structure is determined by the 
frequency of the path being disturbed by branches. If the path of a walk meet with fewer 



branches, the Partnum would be larger. This can be compared with the conclusion in Ref. [16|. 
In Ref. H^ a simple version of HP model of protein is employed. A walk is reduced to a string of 
Os and Is, which represent the surface and core site respectively, as the backbone is traced. Each 
walk is therefore associated with a point in a high dimensional space. Sequences are represented 
by strings of their hydrophobicity and thus can be mapped into the same space. It was found 
that walks far away from other walks in the high dimensional space are highly designable and 
thermodynamically stable. For this reason, highly designable structures are called atypical in 
Ref. ^M. Here the structures with large Partnum can also be called atypical (atypical average 
local freedom) since these structures correspond to paths on the hierarchical tree with fewer 
branches. 

In an analog to the suggestion that nature selected out only highly designable structures, we 
assume that there exists a random process which selects out only the structures with the largest 
Partnum. It is interesting to see what this assumption will result in. 

For concise we assume a critical Partnum Pc, so that only a small portion of oriented walks 
for which PI > Pc can be selected out. Two oriented walks are called n- level similar if their first 
n — 1 steps are along the same path, and they branched at the n-th steps. Suppose si is among 
the structures with the most highest Partnum satisfying Pl(si) > Pc- This means that there 
are few branches along the path of si. As a result, it is difficult to find walks which show high 
level similarity to si. But if there do exist such walks, these walks should have high possibility 
to be selected out. For example, if S2 is A^ — 1 level similar with si, N being the chain length, 
then Pl{s2) = Pl(si) > Pc- More generally, let n\2 being the similarity level between S2 and si. 
We know that Pl(s2) = Pl(si) — (1 — -^^ln{Co))., Co = 3 being the maximal possible choices 
per step during the search of structures. According to this expression, the more similar S2 is to 
si, the more possible it is to be selected out. 

Assuming that S3 is another walks with Pl{s^) > Pc, but it is dissimilar to si. From above 
discussion we know that there are two families, all the members of which are selected out. Within 
each family, the similarity level of two walks is much higher than ni3, while any two walks from 



different families are dissimilar from each other, and the similarity level is rii^. We thus come 
to the conclusion that the selected walks belong to separate families, walks within each family 
are similar, while walks belonging to different families are dissimilar. 

For the non-oriented structures, there is no the convenience of the hierarchical tree to discuss 
their properties. But it is believable that the above result be kept once similarity between 
structures is properly defined. This is the case for the classification of real protein structures, 
where more or less arbitrary criteria [|l|, ^ |3|, ^, 23, 24, 25 1 are used to define the similarity 



between protein structures and to classify structure into families, superfamilies , folds, and so 
on. 

3 The statistical properties of Partnums 

Natural single domain proteins exist only within a limit range of sequence length. By both 
theoretical and numerical studies it is showed in Ref. p6[ that the stability of folded sequences 
against mutation decrease with the increase of chain length. In that follows the dependence of 
the statistical properties of Partnum on chain length will be discussed. We will show how some 
structural properties are determined by general statistical principle. 

The density distribution of P2 are shown in Fig. 4. Things are similar for PI. In both 
cases, visually the distribution becomes more and more normal. Actually it will be shown that 
the distribution is Gauss distribution in the long chain limit. As the first step, however, let's 
much generally, assume that the Partnums of chain length A^ can be described by a density 
distribution function, F{P,vi,V2..-), Vi being the moment of i-th orders. It is easy to get the 
average vi =< P > and variance V2 = AP. The results of both oriented walk and non-oriented 
structures are shown in Fig. 3. 

Fig. 3 shows that < P > (both < PI > and < P2 >) decrease with the increase of chain 
length. However, from the definition of Partnum we know that < PI > (< P2 >) can not be 
smaller than —ln3 (— 2/n3). So, for either oriented walks or non-oriented structures, there must 
exist (5, so that 

lirriN >oo < P >= S- 

A similar argument applies to AP, where 

lirriN >oo AP = e, e > 0. 

It is known that the total number of compact structures M increase exponentially with the 
increase of chain length N: M{N)r^{Cav)^ ■, Cav < Co = 3 being the average number possible 
choices per step for the walks. This gives one way to estimate the value of Cav using the 
knowledge of M{N). Fig.5 show the fit of the data M{N) to /(AT) = {lnCav)N + b. The 
result is Cav = 1.397. Viewing this value of Cav as the value in long chain limit, we get that 
5 = ln{l/Cav) = —0.3343 for oriented walks, a reasonable estimation (see Fig. 3). It should 



be noticed that Cav get this way is much larger that given by mean field consideration,! 21, 27| 



where Cav = Cq/c = 1.1. According to this Cav, ^ = —0.099 for non-oriented walks. From Fig. 3 



we know that this is a value too large to be the long chain limit of < PI >. So it seems that 
the mean field treatment does not apply to the two dimensional protein model. 

With the help of central limit theorem, we can argue that the density distribution is Gauss 
distribution in long chain limit, and e = 0. See follows. 

In the space of compact structures, the Partnum P of certain structure is the average of 
the partnums pi of all the T steps. For oriented walks T equals to the chain length subtracted 
by 1, and for non-oriented structures this value should be doubled further. Now divide the T 
partnums into {T)/n groups (suppose T/n is an integer). In each group the n members are 
chosen randomly within the total T numbers. For each group we define a new random variable 
qk = '^iPi/n, k being the group index. Since the members in each group are chosen randomly 
form the total T numbers, the T/n newly defined random variable should have the same average 

and variance when n — > oo. At the same time, since P = t^z^tt;^) applying the central limit 



theorem, |28] we know that P is a Gaussian random variable, and 6P — *• when T/n — > oo. 

From the above discussion we know that, according to Partnums, statistically all the compact 
structures become indistinguishable in long chain limit. Recalling the selection rule assumed 
above, we know that it becomes increasingly difficult to select out atypical structures when chain 



length increases. These results show some connection to the work of Ejtihadi et aZ..|2C] With a 
purely geometrical approach, they were able to reduce largely the candidates of structures that 
can be chosen as the ground states of sequences. They found that for the case of HP protein 
model the number of ground state candidates grows only as N'^, N being the sequence length. 
While, as pointed out above, the total number of compact structures increase exponentially with 
the increase of N. So it becomes increasingly difficult to find the ground state candidates. This 
is in accordance with the statistical property of Partnum. 

For fulfilling biology functions, proteins should possess some properties, for example fast 
folding, thermodynamically stable and stable against mutation. 1 12, 13, 26, ^, Q It was pos- 



tulated that with the increase of sequence length, the folded structures become more and more 



difficult to possess these properties.] 26] Based on the study of Partnum, we propose that this 
property of proteins is determined by the statistical properties of protein structures, the detail 
of interaction having weak influence. 

4 Conclusion Remarks 

Protein structures seem to be a very special class among all the possible folded configurations of 
polypeptide chain. We now know something about how special it is, but little on why it be so. 
Ways of characterizing folded structures, from whatever point of view, will help to deepen our 
understanding about protein structures. In this paper, the study on Partnum itself is interesting, 
and more interesting when compared with the dynamic and thermodynamic study of proteins. 
The concept of Partnum is simple and can only be applied to lattice model. But the study on it 
reveals that it is possible to investigate protein structures with no consideration of interaction 
detail. 
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Figure 1: (A):The 3x3 square lattices used to find out all the compact structures of 9 beads 
chain. The bold curve is an oriented walk start at (0,0) and end at (0,2). The arrows show that 
instead of walking along the bold curve, one can find other structures in the direction of the 
arrows. (B): The oriented walks and their branching pattern during the search of them. Note 
that only some points show branching on the tree. Others are truncated because they can not 
extend to 9 beads length due to the restriction of lattice size and self avoiding. The number at 
the right of the figure show the steps of the search. 
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Figure 2: Points: Designability against Partnum of non-oriented compact structures of chain 
length 25. Line: the curve of f{x) = ax + b with a = 488809 and b = 300344. The correlation 
coefficient is r = 0.447, with totally 621 data points. 
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Figure 3: (A): The dependence of the average of Partnums < P > on chain length. The upper 
hne-points curve is for the oriented walks, and the lower line-points curve is for the non-oriented 
structures. The upper and lower doted straight lines < P >= —0.3343 and < P >= —0.6686 
are the estimated long chain limit of < PI > and < P2 >, respectively (see text). (B): The 
dependence of the variance of Partnums on chain length. 
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Figure 4: The density distributions of Partnums of non-oriented structures under various chain 
length. The number "49" in "P2-49", for example, is the chain length. The distribution curves 
are shown in step curve style. 
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Figure 5: Logarithm of the total number of oriented walks versus the chain length. The line 
is the fit using InM = ln{Cav)xN + b, with Cav = 1.3969 and b = -0.9489. The correlation 
coefficient is r = 0.99. 
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